Motivation, context, history, related topics ...
Terms
Linearly separable means ...
Consider this dataset and new observation.
Score observations based on their distances from the boundary
To predict a new observation's class, find its score in the same way and make the prediction based on its score sign.
Alternatively, score observations in a different way, interpret the scores as probabilities, and predict based on probability.
(Various implementations of the support vector machine method employ various schemes to determine probabilities - and may even incorporate some randomization due to cross-validation.)
Note that the sigmoid function rescales any value into the range 0 to 1. The sigmoid function is also known as the logistic function.
sigmoid: $\large \frac{1}{(1+{e}^{−x})}$
sigmoid range: $0 < sigmoid(x) < 1$, for any $x$
First, score observations based on their distances from the boundary
Second, re-scale the scores so that observations on the margin edges have score 1. This step provides for scores to be distributed in an intuitive way.
Third, further re-scale the scores to the range 0 to 1. Use the sigmoid function to do this. This step provides for scores to be interpreted as probabilities.
To predict a new observation's class, find its score in the same way, interpret the score as a probability, and make a prediction based on the probability.
Use the probability with a cutoff to make a prediction.
Note, these observations are not linearly separable.
Consider observation x1=5 and observation x1=20 as candidates for the support vectors. The gap would be 20-5=15, but observation x1=4 would be distance 16 on the wrong side of the x1=20 edge. If cost was chosen as 0.1, then the penalty would be 0.1x16=1.6, and so the gap with penalty would be 15-1.6=13.4. Similarly, if cost was chosen as 1 or 10, then the gap with penalty would be 15-(1x16)=-1 or 15-(10x16)=-145, respectively.
Consider observation x1=3 and observation x1=20 as candiates for the support vectors. The gap would be 20-3=17, but observation x1=4 would be distance 16 on wrong side of the x1=20 edge, and observation x1=5 would be distance 2 on the wrong side of the x1=3 edge. If cost was chosen as 0.1, then the penalty would be 0.1x(16+2)=1.8, and so the gap with penalty would be 17-1.8=15.2. Similarly, if cost was chosen as 1 or 10, then the gap with penalty would be 17-(1x(16+2))=-1 or 17-(10x(16+2))=-163, respectively.
Note, the choice of cost influences the determination of the support vectors, which determine the boundary. In our example, at cost=0.1, observations x1=3 and x1=20 would be preferred over observations x1=5 and x1=20 as support vectors because 15.2 > 13.4. A new observation x1=12 would be classified by sign as B. At cost=10, the reverse would be preferred because -145 > -163, and a new observation x1=12 would be classified by sign as A.
The support vector machine method relies on an optimization algorithm to find the support vectors by maximizing the penalized space between edges.
Introduce a new variable by applying a kernel function to the original dataset. This effectively increases the dimensionality of the dataset. Here, we apply the kernel function $y = (x-12)^2$.
Score a new observation based on its distance from the boundary and predict the new observation's class based on its score sign.
Alternatively, assign probabilities to observations and based on their scores, and predict a new observation's class based on its probability.
Introduce a new variable by applying a kernel function to the original dataset. This effectively increases the dimensionality of the dataset. Here, we apply the kernel function $y = -1.8 + 3.0x -0.4x^2 +0.1x^3$.
Score a new observation based on its distance from the boundary and predict the new observation's class based on its score sign.
Alternatively, assign probabilities to observations and based on their scores, and predict a new observation's class based on its probability.
Introduce a new variable by applying a kernel function to the original dataset. This effectively increases the dimensionality of the dataset. Here, we apply the kernel function $z = (x1-4)^2 + (x2-2)^2$.
Copyright (c) Berkeley Data Analytics Group, LLC Document revised October 17, 2019